Freestyle Pi

Demonstration Video

Introduction

Our design is a Raspberry Pi-based intelligent assistant that can do freestyle rapping about a certain topic given by the voice command from the user. It will automatically detect the trigger word(its name "Andrew"), transform voice collected from microphone to text, and then understand the topic from a natural language sentence, generate related lyrics and background beat, and eventually play them with the speaker. We also implemented the screen display, which could show the dialogue content and signal waves as background. And finally we optimize the dialogue-based interaction so that it will speak out the current weather of a given location. It is an embedded device with microphone and speaker as input and output, and can interact with users using voice and language processing algorithms.

Project Objective:

Wait quietly and wake up when its name is called
Understand your voice command and make reactions accordingly
Generate hip-hop lyrics on-the-fly and rap it with the machine-generated music

Design & Testing

Wake word detection
Build up a wake word listener to continually listen to sounds around the device, and activate when the sounds or speech match a wake word. First, we use chunking to calculate the MFCC features of the speech real-time, and then input the generated features into a neural network consisting of 20 gated recurrent units (GRUs), finally making predictions every chunk to check whether it is the wake word or not.
Speech to Text & Text to Speech
Once the wake word is detected, the speech-to-text function will be triggered to record and convert voice to text. Considering the performance of Raspberry Pi, we chose to use mature online service to do the recognition. By continuously sending the audio chunks to the Google Cloud Speech API, we are able to get the real-time recognized text.
Topic understanding
With the text recognized, we need the Raspberry Pi to understand the content of the text, so that it can make correct reactions to the user’s command. We make the use of Microsoft Azure Language Understanding service (LUIS) to extract user’s intent and the corresponding entities from the recognized sentence. In our system, we defined two major intents: “Freestyle”, “Weather” and for other topics, currently we just ignore them. With LUIS, we are able to get the entities in the sentence, for example, if we ask about the weather in a specific city, the city name will be returned as an entity. So we are able to flexibly process these intents.
Rap lyrics generation
In general, we used a character-level Recurrent Neural Networks with LSTM unit. We chose character-level representation because:
- it does not require tokenization as a preprocessing step
- it does not require unknown word handling
- it could generate on a comparatively small vocabulary, less memory
- it could mimic grammatically correct sequences for a wide range of languages
- it also include punctuations to make pause of lyrics more natural
We chose the LSTM (long short-term memory) unit because it could take more context into consideration and avoid vanishing gradient at the same time.
We picked Eminem as the imitation object of our model because according to a study(conducted by lyrics site Musixmatch), Eminem has the largest vocabulary in the music industry. I found a lyrics dataset scraped from LyricsFreak, which includes 70 Eminem songs.
We first combine those entries into a large 200k-character string with 50 unique characters, then cut the text into semi-redundant sequences of characters and the vectorize them into the input sequence, the output sequence is the next character of this sequence in the corpus.
The model is built on keras example, consists of a linear stack of long short-term memory layer and a regular fully-connected neural network layer. Because of the limit of computational resources, each epoch takes about 1 minute. The model is trained 1200 times and this part cost us 30 hours in total.
Beat generation
The deeping learning AI team provided a dataset, which preprocessed the musical data so that we could render it in terms of musical "values." Each value can be considered as a note, which comprises a pitch and duration.
Similar to the text generation model, the beat generation is also learnt by a LSTM network. The architecture of the model is illustrated in the figure below. The difference between the lyrics model and the beat model is the first input is randomly generated rather than given by the user.

Issues

Audio Driver issue: Although we can directly use the built-in audio players like "aplay" to play raw audio files. Our goal is to use Python code to dynamically record and play the sound chunks. The library we use, PyAudio, cannot use the built-in 3.5mm headphone jack and the USB microphone at the same time. We fix this by switching the playing part to the pyalsaaudio library.
Stream file conversion: The generated beat is saved in the music stream originally. There is no straight function or package to save stream as ‘.wav’ files directly, so we saved it as ‘.midi’ files first and then transfer them to ‘.wav’ files that are convenient to play on Raspberry Pi.
Explicit lyrics filter: We didn’t preprocess the 200,000 characters before training the model so the generated lyrics inherent some bad words from Eminem’s songs. We tried to replace the bad words we could imagine with ‘love’ but failed to enumerate them.

Drawings

Final demo

Result

We basically followed our expected time schedule and accomplished the basic functions we proposed in the first proposal. In addition, this freestyle Pi can implement some dialogue-based interactions, for example, it can tell you the weather of a given location and the most handsome man in the world, so we consider this project as a success. Future work is needed to make this voice assistant more intelligent.

Future Work

The lyrics generation starts with the given topic word, so actually we are not freestyle about the topic. We plan to use words that are close to the topic in word embeddings to compute the lyrics rather than the single topic word.
Character-level language models require more training epochs and larger corpus to make its output text sequence more natural.
We could further employ a plug-in board for beat generation. Taking user input as the first token of the beat generation network.

Work Distribution

Project group picture

Xin

xf78@cornell.edu

Designed the overall software architecture (Just being himself).

Yangmengyuan Zhao

yz2453@cornell.edu

Lyrics and Beat Generation

Figure Design and Vedio Edition

Project Parts

Parts	From	Cost
Raspberry Pi	Lab	$0.00
Speaker	Lab	$0.00
PS3 Eye Microphone	Amazon	$8.53

Total: $8.53

Acknowledgements

We really appreciate everyone who has helped us building this project:

Prof. Joseph F. Skovira
Course TAs: Canhui Yu
Course TAs: Rohit Krishnakumar

References

The Largest Vocabulary In Music
Keras LSTM example
Deep Learning AI Specialization
Pyalsaaudio Library
Mycroft-precise

Code Appendix


    import sys
    import time
    import random
    import queue
    import threading
    
    from termcolor import cprint
    from utils.audio import ResumableMicrophoneStream
    from utils.detect_queue import DetectQueue
    from utils.credentials import init_credentials
    
    from trigger_detector import TriggerDetector
    
    from speech_to_text import SpeechToText
    from lang_understand import LangUnderstand
    from text_to_speech import TextToSpeech
    from lyrics_generator import LyricsGenerator
    
    from tft_display import TFTDisplay
    
    # Audio recording parameters
    SAMPLE_RATE = 16000
    CHUNK_SIZE = int(SAMPLE_RATE / 10)  # 100ms
    STREAM_LIMIT = 5000
    
    
    class Andrew(object):
        """the rap voice assisstant
        """
        def __init__(self, detect_model="data/andrew2.net",
                            lyrics_model="data/keras_model_1200.h5",
                            lyrics_chars="data/chars.pkl"):
            # microphone
            self.mic = ResumableMicrophoneStream(SAMPLE_RATE, CHUNK_SIZE)
    
            # wake word detector
            self.detector = TriggerDetector(detect_model)
    
            # speech and language services
            self.speech_client = SpeechToText()
            self.luis = LangUnderstand()
            self.tts = TextToSpeech()
    
            # lyrics generator model
            self.lyrics_gen = LyricsGenerator(lyrics_model, lyrics_chars)
    
            self.pred_queue = DetectQueue(maxlen=5)
            self.is_wakeup = False
    
            # pytft display
            self.tft = TFTDisplay()
            self.tft_queue = queue.Queue()
            self.tft_thread = threading.Thread(target=self.tft_manage, args=())
            self.tft_thread.daemon = True
            self.tft_thread.start()
    
            self.notify("hi_there")
    
    
        def notify(self, topic="hi_there", is_async=False, audio_path="data/audio"):
            # Notify with local preset audio files
            from os.path import join, isfile
            audio_file = join(audio_path, f"{topic}.wav")
            if not isfile(audio_file):
                return
    
            self.tts.play_file(audio_file, is_async)
    
    
        def generate_rap(self, topic="", beat_path="data/beat"):
            """Generate rap and play
            """
            tts = self.tts
            lyrics_gen = self.lyrics_gen
    
            response = tts.generate_speech(f"hey, I can rap about {topic}")
            tts.play(response, True)
    
            # Generate based on topic
            lyrics_output = lyrics_gen.generate(topic)
    
            # Generate speech
            lyrics_speech = tts.generate_speech(lyrics_output)
    
            # Select beat
            beat_index = random.randint(0, 20)
    
            # Play beat and lyrics
            tts.play_file(f'{beat_path}/beat_{beat_index}.wav', True)
            tts.play(lyrics_speech)
    
        def get_weather_message(self, city="Ithaca"):
            import requests, json, os
            api_key = os.getenv('WEATHER_APIKEY')
            base_url = "https://api.openweathermap.org/data/2.5/weather?"
            city_name = f"{city},us"
            complete_url = f"{base_url}q={city_name}&units=imperial&APPID={api_key}"
            try:
                response = requests.get(complete_url)
                res = response.json()
                msg_weather = f"Today, it's {res['weather'][0]['description']} in {city}. "
                msg_temp = f"The temperature is {int(res['main']['temp'])} degrees."
                return msg_weather + msg_temp
            except:
                pass
    
            return ""
    
    
        def intent_recognize(self, text=""):
            """Recognize intent
            """
            luis = self.luis
            tts = self.tts
    
            # Get result from language understanding engine
            luis_result = luis.predict(text)
            intent = luis_result.top_scoring_intent.intent
    
            if intent == "Freestyle":
                entities = luis_result.entities
                entity_topic = "rap"
                if (len(entities) > 0):
                    entity = entities[0]
                    cprint(f'The topic is {entity.entity}', 'cyan')
                    entity_topic = entity.entity
                self.generate_rap(entity_topic)
    
            elif intent == "Weather":
                response = tts.generate_speech("I will tell you the weather in Ithaca.")
                tts.play(response)
    
                weather = self.get_weather_message()
                response = tts.generate_speech(weather)
                tts.play(response)
    
            else:
                self.notify("sorry")
    
    
        def tft_manage(self):
            """Manage TFT display through state
            """
            self.tft.display_text("Andrew is waking up")
            status = {'state': 'None'}
    
            while True:
                if status['state'] is 'wait':
                    self.tft.display_wave()
    
                elif status['state'] is 'listen':
                    self.tft.display_wave((0, 255, 0))
    
                # Update the status
                try:
                    update = self.tft_queue.get(block=False)
                    if update is not None:
                        status = update
    
                except queue.Empty:
                    continue
    
    
        def start(self):
            """Start listening and interacting
            """
            tft = self.tft
            tts = self.tts
    
            # Init stream
            with self.mic as stream:
    
                self.tft_queue.put({'state': 'listen'})
    
                while True:
                    if not self.is_wakeup:
                        stream.closed = False
    
                        while not stream.closed:
    
                            stream.audio_input = []
                            audio_gen = stream.generator()
    
                            for chunk in audio_gen:
                                if not self.is_wakeup:
    
                                    prob = self.detector.get_prediction(chunk)
    
                                    self.pred_queue.append(prob > 0.6)
                                    print('!' if prob > 0.6 else '.', end='', flush=True)
    
                                    if (self.pred_queue.count >= 2):
                                        self.notify("hi")
                                        cprint(' Trigger word detected! \n', 'magenta')
                                        self.pred_queue.clear()
                                        self.is_wakeup = True
                                        stream.pause()
                                        break
                    else:
                        cprint('Speech to text\n', 'green')
    
                        time.sleep(1)
                        stream.closed = False
    
                        try:
                            voice_command = self.speech_client.recognize(stream)
    
                            cprint(f'{voice_command}\n', 'yellow')
                            cprint('Recognition ended...\n', 'red')
    
                            stream.pause()
    
                            #tft.display_text(f'"{voice_command}"')
    
                            if ("goodbye" in voice_command):
                                self.notify("see_you")
                                exit()
    
                            if ("sorry" in voice_command):
                                self.notify("its_ok")
    
                            else:
                                cprint('Recognize intents...', 'cyan')
                                self.intent_recognize(voice_command)
    
                        except Exception as e:
                            cprint(f'Error: {e}', 'red')
    
                        self.is_wakeup = False
    
    
    def main():
    
        # set credentials for cloud services
        init_credentials()
    
        # init and start andrew
        andrew = Andrew()
        andrew.start()
    
    
    if __name__ == "__main__":
        main()